CS4132 Data Analytics
Books have been here since the beginning of history, playing a key role in keeping events and ideas safe for anyone to see later on. As we continue to develop and grow, books too have done the same. Gone are the days where the only way you could find a book is by looking for it in a library or a store. Books have evolved into many forms online, pushed by one main player, a major distributor, Amazon.
Amazon has allowed for anyone to be able to access books online, even publish books themselves without anyone helping them at all. But this has led to the problem of information overload. While the 'over 80,000 books' the front page says is not incorrect, it hides from the true scale that they have, with a reported over 30 million books available for purchase in just 2014. One can only imagine that this number has continued to grow since the 8 years this information has stopped becoming publicly available.
With all this known, it's easy to see how difficult it is to stand out in the sea of books available. In this project, I will analyse the factors contributing to a book's success, which allows it to overcome this statistical hurdle.
What are the factors that contribute to a book's popularity?
Which Genres are the most popular?
In general, over the years, Literature & Fiction and Children's Books are the most popular genres, following them are a bunch of genres very close in popularity with one another
Publishing in which genre is most likely to lead to the most popularity?
The brackets refer to the file(s) in question in the appendix
index - This a list of the main categories that I am interested in. It consists of all the main categories available minus Calendars, as they have a serve a different purpose than what I am investigating, and Christian Books & Bibles, as the exact category with all its subcategories are also accessible through Religion & Spirituality which I have searched.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import datetime
from scipy import stats
from tqdm.auto import tqdm
import plotly.express as px
tqdm.pandas()
index = ['Arts & Photography',
'Biographies & Memoirs',
'Business & Money',
"Children's Books",
'Comics & Graphic Novels',
'Computers & Technology',
'Cookbooks, Food & Wine',
'Crafts, Hobbies & Home',
'Education & Teaching',
'Engineering & Transportation',
'Health, Fitness & Dieting',
'History',
'Humor & Entertainment',
'Law',
'LGBTQ+ Books',
'Literature & Fiction',
'Medical Books',
'Mystery, Thriller & Suspense',
'Parenting & Relationships',
'Politics & Social Sciences',
'Reference',
'Religion & Spirituality',
'Romance',
'Science & Math',
'Science Fiction & Fantasy',
'Self-Help',
'Sports & Outdoors',
'Teen & Young Adult',
'Test Preparation',
'Travel']
df_all - Contains the book urls stored by genre (deepest subgenre) with each column storing a book. Unwanted main categories were removed in the webscraping process. The first row contains the estimate that amazon gives on the number of books in the category. However, from how widely off the estimate of 'over 80,000' is on the main page (Dataset 1.), I do not believe that these are good estimated to gauge the number of books in the category, and it is thus removed in data cleaning
df_all_list = []
for a in index:
df_books = pd.read_csv('Booklist data/' + a + '.csv', index_col = [0,1,2,3,4])
df_all_list.append(df_books)
df_all = pd.concat(df_all_list, keys = index, axis=0)
df_all.head()
| Unnamed: 5 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | Unnamed: 166 | 1201 | 1202 | 1203 | 1204 | 1205 | 1206 | 1207 | 1208 | 1209 | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sub_Cat | Sub1_Cat | Sub2_Cat | Sub3_Cat | Sub4_Cat | ||||||||||||||||||||||
| Arts & Photography | Architecture | Buildings | Landmarks & Monuments | --- | --- | 0 | over 5,000 | /Manmade-Wonders-World-Dan-Cruickshank-ebook/d... | /Architecture-Visual-History-Jonathan-Glancey/... | /99-Invisible-City-Hidden-Everyday/dp/03581266... | /Abandoned-Kentucky-Sherman-Cahal/dp/177276171... | /Unique-America-Strange-Unusual-Through/dp/168... | /Understanding-Cemetery-Symbols-Historic-Grave... | /Noble-Ambitions-English-Country-House/dp/1541... | /Building-Chicago-Architectural-John-Zukowsky/... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Religious Buildings | --- | --- | 0 | over 5,000 | /Brunelleschis-Dome-Renaissance-Reinvented-Arc... | /Churches-Paris-Peggy-Shannon/dp/1788841018/re... | /Understanding-Cemetery-Symbols-Historic-Grave... | /Memento-Mori-Dead-Among-Us/dp/0500252610/ref=... | /Mosques-Splendors-Islam-Leyla-Uluhanli/dp/084... | /Cathedral-Story-Construction-David-Macaulay/d... | /Cathedrals-Masterpieces-Architecture-Feats-En... | /Shinto-Kami-Way-Sokyo-Ph-D/dp/0804835578/ref=... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |||
| Residential | --- | --- | 0 | over 10,000 | /Pretty-Good-House-Michael-Maines/dp/164155165... | /Field-Guide-American-Houses-Revised/dp/037571... | /Very-Efficient-Carpenter-Residential-Construc... | /Parisian-Design-Interiors-David-Jimenez/dp/08... | /Suzanne-Kasler-Edited-Style/dp/0847872521/ref... | /Oasis-Modern-Desert-Homes-Around/dp/052557515... | /Case-Study-Houses-Basic-Art/dp/3836535602/ref... | /Arriving-Home-Gracious-Southern-Welcome/dp/14... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |||
| Criticism | --- | --- | --- | 0 | over 7,000 | /Architecture-Francis-D-K-Ching/dp/1118745086/... | /Pattern-Language-Buildings-Construction-Envir... | /Global-History-Architecture-Francis-Ching/dp/... | /Airbnb-Business-Date-Hard-Find-ebook/dp/B0B52... | /Architecture-Students-Handbook-Professional-P... | /Understanding-Architecture-Elements-History-M... | /Poetics-Space-Gaston-Bachelard/dp/0143107526/... | /Buildings-Across-Time-Introduction-Architectu... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||
| Decoration & Ornament | --- | --- | --- | 0 | over 6,000 | /Finer-Things-Timeless-Furniture-Textiles/dp/0... | /At-Home-Evocative-Art-Forward-Interiors/dp/14... | /Cottage-Fairy-Companion-Cottagecore-Connectin... | /Arriving-Home-Gracious-Southern-Welcome/dp/14... | /Cabin-Porn-Inspiration-Quiet-Somewhere/dp/031... | /Perfectly-Imperfect-Home-Decorate-Live/dp/030... | /Cozy-Minimalist-Home-Style-Stuff/dp/031035091... | /Welcome-Home-Minimalist-Decorating-Hosting/dp... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 1339 columns
df_bookinfo contains all the book data of the scraped books. As there are many columns which are not of use, the explanation for each column will be in the data cleaning section, after the data has been cleaned.
The way that the data is structured is that if a book has multiple formats (e.g. Hardcover, Kindle, Paperback, Audiobook), then the row will each contain a seperate format of the same book. We ensure that the original scraped page is tracked by through the Book Url column.
lst = [-1,2999,5999,8999,11999,12999,13999,14999,17999,20999,21666]
df_list = []
for a in range(len(lst) - 1):
df_info = pd.read_csv(f'Bookinfo data/{lst[a] + 1}_to_{lst[a+1]}.csv')
df_info = df_info.iloc[:,1:]
df_list.append(df_info)
df_bookinfo = pd.concat(df_list, axis=0)
df_bookinfo.head()
| Title | ASIN | Publisher | Publication date | Language | File size | Text-to-Speech | Enhanced typesetting | X-Ray | Word Wise | ... | Turtleback | Workbook | #2,058 in Teen & Young Adult Literature & Fiction | #262 in Teen & Young Adult Christian Fantasy | #2,970 in Teen & Young Adult Social & Family Issue Fiction | #121 in Teen & Young Adult Fairy Tale & Folklore Adaptations | #154 in Budget Travel Guides | #45,072 in Memoirs (Books) | Unknown Binding | #703 in Murder & Mayhem True Accounts | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Manmade Wonders of the World | B07YBZ2WSK | DK (October 8, 2019) | October 8, 2019 | English | 262248 KB | Not enabled | Not Enabled | Not Enabled | Not Enabled | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | Man-Made Wonders of the World | NaN | DK; Illustrated edition (October 8, 2019) | NaN | English | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | Architecture: A Visual History | NaN | DK; Illustrated edition (November 21, 2017) | NaN | English | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | The 99% Invisible City: A Field Guide to the H... | NaN | Dey Street Books; Illustrated edition (October... | NaN | English | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | The 99% Invisible City: A Field Guide to the H... | B086YHKSJX | Mariner Books; Illustrated edition (October 6,... | October 6, 2020 | English | 64102 KB | Enabled | Enabled | Enabled | Enabled | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 189 columns
df_reviews contains all of the reviews data of the scraped books. Reviews not from verified customers were filtered out of this dataframe. Some books have more than 5000 reviews, after which, all further reviews are inaccessible. To still get a spread of data from varying dates, the reviews are sorted by featured instead of by date. This is so the loss in data is somewhat random, rather than removing off all reviews submitted before a given date. To compensate for this loss, data calculated for popularity (the weighted sum of the average rating by the number of ratings) will be scaled up accordingly.
Additionally there is a case where a book might not have received any reviews from verified customers. In such cases, the book will not appear in this dataset.
International reviews, or reviews coming from outside the United States are unaccessible in books with this case as they are always shown last. This unfortunately prevents us from being to investigate the popularity of all books internationally. The columns refer to the following.
This dataset contains 7, the explanations of which are below
lst = [-1,999,1999,2999,3999,4999,5999,6999,7999,8999,9999,10999,13999,14999,15999,16999,17999,18999,19999,20999,21666]
df_list = []
for a in range(len(lst) - 1):
df_review = pd.read_csv(f'Reviews data/{lst[a] + 1}_to_{lst[a+1]}.csv')
df_review = df_review.iloc[:,1:]
df_list.append(df_review)
df_reviews = pd.concat(df_list, axis=0)
df_reviews.head()
| Book URL | Num Ratings | Num Reviews | Name | Account ID | Rating | Date | |
|---|---|---|---|---|---|---|---|
| 0 | /Manmade-Wonders-World-Dan-Cruickshank-ebook/d... | 418 | 36 | Kindle Customer | AERSA5VTXDB2WRV2RVWZ2MUDVDVQ | 4.0 out of 5 stars | Reviewed in the United States on September 24,... |
| 1 | /Manmade-Wonders-World-Dan-Cruickshank-ebook/d... | 418 | 36 | Marshal Mueller | AH7TCO74B46RIFI53NFOOFXFTLFA | 5.0 out of 5 stars | Reviewed in the United States on February 6, 2022 |
| 2 | /Manmade-Wonders-World-Dan-Cruickshank-ebook/d... | 418 | 36 | Hammad Iqbal | AGJIIPF56XA2AMEMXKYNPLYYRGQA | 5.0 out of 5 stars | Reviewed in the United States on November 2, 2021 |
| 3 | /Manmade-Wonders-World-Dan-Cruickshank-ebook/d... | 418 | 36 | Edith Leet | AEW56N2K52FVCR7GKSKI3AWV64RA | 5.0 out of 5 stars | Reviewed in the United States on December 11, ... |
| 4 | /Manmade-Wonders-World-Dan-Cruickshank-ebook/d... | 418 | 36 | Kiki | AFDPHQHRSOCGY73XICFL6QGG2IVQ | 4.0 out of 5 stars | Reviewed in the United States on July 8, 2021 |
Due to time constraints, I was unable to scrape all the book for each genre which I had set out to do. Instead, the books scraped are the up to the first 8 books of each genre. This is reflected in df_searched.
The remaining books are reshaped into 1 column so as to make searching for genre by book easier. Additionally, all sub genres are removed as they are beyond the scope of this investigation
df_list = []
for a in range(1,9):
df = df_all.loc[:,f'{a}']
c = pd.DataFrame(df)
c = c.rename(columns={f'{a}' : 'Book URL'})
df_list.append(c)
df_searched = pd.concat(df_list, axis=0)
df_searched = df_searched.dropna(axis=0, how='any')
df_searched.index.names = ['Main Genre', 'Sub1 Genre', 'Sub2 Genre', 'Sub3 Genre', 'Sub4 Genre', 'Sub5 Genre']
df_searched.reset_index(inplace=True)
df_searched = df_searched[['Main Genre', 'Book URL']]
df_searched
| Main Genre | Book URL | |
|---|---|---|
| 0 | Arts & Photography | /Manmade-Wonders-World-Dan-Cruickshank-ebook/d... |
| 1 | Arts & Photography | /Brunelleschis-Dome-Renaissance-Reinvented-Arc... |
| 2 | Arts & Photography | /Pretty-Good-House-Michael-Maines/dp/164155165... |
| 3 | Arts & Photography | /Architecture-Francis-D-K-Ching/dp/1118745086/... |
| 4 | Arts & Photography | /Finer-Things-Timeless-Furniture-Textiles/dp/0... |
| ... | ... | ... |
| 45067 | Travel | /Rand-McNally-Folded-Map-Milwaukee/dp/05280251... |
| 45068 | Travel | /Landscapes-Wyoming-Yellowstone-National-Beaut... |
| 45069 | Travel | /Wyoming-Bucket-List-Adventure-Guide/dp/195759... |
| 45070 | Travel | /Fodors-Montana-Wyoming-Yellowstone-Full-color... |
| 45071 | Travel | /Hey-Kids-Lets-Journal-Washington/dp/194604908... |
45072 rows × 2 columns

Despite what the people who categorised this may think, this is not a book. There are several items like this, all of which are excluded in the webscraping process, the few that make it through are cleaned in the code below
Next, these are some functions that I that I will apply to the dataframe for cleaning
def combine_cols(row):
for a in column_list:
if isinstance(row[a], str):
return row[a]
def get_publication_date(row):
try:
if len(row['Publisher'].split('(')) == 2:
if row['Publisher'].split('(')[1].split(' ')[0] in ['January', 'February', 'March', 'April', 'May', 'June',
'July', 'August', 'September', 'October', 'November', 'December']:
return row['Publisher'].split('(')[1][:-1]
except:
if isinstance(row['Publication date'], str):
return row['Publication date']
#elif isinstance(row['Audible.com Release Date'], str): #code never reaches here
# return row['Audible.com Release Date']
def clean_enabled(row):
if isinstance(row[a], str):
if (row[a] == 'Enabled'):
return True #Cannot directly convert to boolean as the dtype has to be consistent
else:
return False
def reformat_weight(row):
if isinstance(row['Item Weight'], str):
comps = row['Item Weight'].split(' ')
if comps[1] == 'pounds':
return float(comps[0]) * 16
else:
return float(comps[0])
def reformat_time(row):
if isinstance(row['Listening Length'], str):
comps = row['Listening Length'].split(' ')
if len(comps) == 5:
return int(comps[0]) * 60 + int(comps[3])
else: #comps is len 2
if comps[1] == 'hours':
return int(comps[0]) * 60
else: #is minutes
return int(comps[0])
def fix_ready(row):
if row['Format'] == 'Audiobook':
if row['Whispersync for Voice'] == 'Ready':
return True
else:
return False
Amazon has different ways of formatting each book info page. Of a possibly higher number of different formats, I encountered 3 main formats depending on whether the book in question is a physical book, an ebook, or an audiobook. Even with this however, their formatting is still inconsistent. For example, in most cases the best sellers ranking is contained within another list, which would be skipped by the webscraping code and is intended. (I decided to ignore this as the best sellers list can be easily manipulated by purchasing many copies of your own books for example, and has been manipulated before. It is thus not resilient metric to use.) However, in some cases, it is stored as another entry instead of a table, which I cannot differentiate without checking its contents. This has led to the many columns that just includes the book's rank.
Additionally, from what I've noticed, there are 2 formats that amazon uses to display the book, as seen in the pictures.

Although seemingly inconsequential at first, the formats actually show different information from each other, with the format on top allowing us to access books based on format and edition which allows us to differentiate the book's popularity by format. (Although not to the extent that we can differentiate over time like reviews allows us to do). To add on, the which format appears isn't consistent, not within the same book, not within the same book format. This has necessitated the Other_format column, where a True value means that the book is presented in the bottom and vice versa.
During the analysis, I discovered that the method I used to collect data from books presented in the the bottom format (taking the, as far as I was aware, global number of ratings and global average rating and subtracting later on) was flawed as some books show a different number of global ratings than others (see below)

This characteristic is not present in every book and there's no way for me to verify its accuracy. Therefore, I have no choice but to remove all books presented in the other format as there is no way to tell whether this has occurred or not. This unfortunately has potentially significantly decreased the calculated popularity of the books involving this dataset.
In the data cleaning, I address this inconsistency in format through the following steps:
lst = [-1,2999,5999,8999,11999,12999,13999,14999,17999,20999,21666]
df_list = []
for a in range(len(lst) - 1):
df_info = pd.read_csv(f'Bookinfo data/{lst[a] + 1}_to_{lst[a+1]}.csv')
df_info = df_info.iloc[:,1:]
df_list.append(df_info)
df_bookinfo = pd.concat(df_list, axis=0)
#Removing columns with rankings
df_bookinfo.reset_index(drop=True,inplace=True) #reset index so setting rows by .loc works
column_list = []
for a in df_bookinfo.columns:
if a.startswith(' '):
column_list.append(a) #only columns with rankings start with ' '
#Combining columns that show the page number
df_bookinfo.drop(column_list, axis=1, inplace=True)
column_list = []
for column_name in df_bookinfo.columns:
try:
column_item = df_bookinfo[~df_bookinfo[column_name].isna()][[column_name]].iloc[0,0]
if (column_item.split(' ')[-1] == 'pages'):
column_list.append(column_name)
except:
pass
df_bookinfo['Num Page'] = df_bookinfo.progress_apply(combine_cols, axis=1)
df_bookinfo['Num Page'] = df_bookinfo['Num Page'].str.split(' ').str.get(0)
df_bookinfo['Num Page'] = df_bookinfo['Num Page'].astype('float64')
df_bookinfo.drop(column_list, axis=1, inplace=True)
#Splitting Publisher into Publisher and Publication date (for some)
df_bookinfo['Publication date'] = df_bookinfo.progress_apply(get_publication_date, axis=1)
column_list = ['Publication date', 'Audible.com Release Date']
df_bookinfo['Publication date'] = df_bookinfo.progress_apply(combine_cols, axis=1)
df_bookinfo['Publisher'] = df_bookinfo.loc[df_bookinfo['Publisher'].notna()]['Publisher'].str.split(';').str.get(0).str.split(' \(').str.get(0)
#Converting Publication date to datetime
d = {'January':'1', 'February':'2', 'March':'3', 'April':'4', 'May':'5', 'June':'6',
'July':'7', 'August':'8', 'September':'9', 'October':'10', 'November':'11', 'December':'12'}
month = df_bookinfo['Publication date'].str.strip().str.split(' ').str.get(-3).map(d)
day = df_bookinfo['Publication date'].str.strip().str.split(' ').str.get(-2).str[:-1] #remove the , at the end of day
year = df_bookinfo['Publication date'].str.strip().str.split(' ').str.get(-1)
df_bookinfo['Publication date'] = pd.to_datetime(month + '/' + day + '/' + year, infer_datetime_format=True, errors='coerce')
# Removing unreleased books
df_bookinfo = df_bookinfo.loc[df_bookinfo['Publication date'] < datetime.datetime(2022, 10, 1)]
#df_bookinfo[df_bookinfo['Publication date'].str.split('/').str.get[-1].astype('Int64') <
#Columns have data in another column / irrelevant
contributors = ['Contributor 1 name', 'Contributor 1 role', 'Contributor 2 name', 'Contributor 2 role', 'Contributor 3 name', 'Contributor 3 role',
'Contributor 4 name', 'Contributor 4 role', 'Contributor 5 name', 'Contributor 5 role', 'Contributor 6 name', 'Contributor 6 role']
df_bookinfo.drop(['Author', 'Narrator', 'Program Type', 'Version', 'Narration accent', 'Audible.com Release Date', 'Language', 'Lexile measure',
'Grade level', 'Reading age', 'ASIN', 'Page numbers source ISBN', 'ISBN-10', 'ISBN-13', 'Dimensions', 'Simultaneous device usage']
+ contributors, axis=1, inplace=True)
#Combining X-Ray columns
column_list = ['X-Ray', 'X-Ray for textbooks']
df_bookinfo['X-Ray'] = df_bookinfo.progress_apply(combine_cols, axis=1)
#Converting type to boolean
column_list = ['Text-to-Speech', 'Enhanced typesetting', 'X-Ray', 'Word Wise', 'Lending']
for column_name in tqdm(column_list):
a = column_name
df_bookinfo[column_name] = df_bookinfo.apply(clean_enabled, axis=1)
#Formatting price
df_bookinfo.loc[df_bookinfo['Price'] == '—', 'Price'] = np.nan #Signifies that the book is not available
df_bookinfo.loc[df_bookinfo['Price'] == 'Free App', 'Price'] = np.nan
df_bookinfo.loc[df_bookinfo['Price'] == '', 'Price'] = np.nan
df_bookinfo.loc[df_bookinfo['Price'] == 'from', 'Price'] = np.nan
df_bookinfo['Price'] = df_bookinfo['Price'].str[1:]
df_bookinfo['Price'] = df_bookinfo['Price'].str.replace(',', '')
df_bookinfo['Price'] = df_bookinfo['Price'].astype('float')
#Formatting filesize
df_bookinfo['File size'] = df_bookinfo['File size'].str.split(' ').str.get(0)
df_bookinfo['File size'] = df_bookinfo['File size'].astype('float')
#Formatting item weight and Listening length
df_bookinfo['Item Weight'] = df_bookinfo.progress_apply(reformat_weight, axis=1) #In ounces
df_bookinfo['Listening Length'] = df_bookinfo.progress_apply(reformat_time, axis=1) #In minutes
#Removing non-book formats and formats with less than 100 entries as there isn't enough to show something conclusive
#Coincidentally, all non-book formats appear less than 100 times
series = df_bookinfo['Format'].value_counts()
format_list = list(series[series > 100].index)
df_bookinfo = df_bookinfo.loc[df_bookinfo['Format'].isin(format_list)]
df_bookinfo['Format'] = df_bookinfo['Format'].str.strip()
#Formatting Whispersync for Voice
df_bookinfo['Whispersync for Voice'] = df_bookinfo.progress_apply(fix_ready, axis=1)
#Removing entries with no Book Url
df_bookinfo.dropna(subset=['Book Url'], axis=0, inplace=True)
df_bookinfo.reset_index(drop=True,inplace=True)
#Removing values with Other_format == True
df_bookinfo.dropna(subset=['Other_format'], axis=0, inplace=True)
df_bookinfo = df_bookinfo.loc[df_bookinfo['Other_format'] == False]
df_bookinfo.reset_index(drop=True,inplace=True)
df_bookinfo.drop('Other_format', axis=1, inplace=True)
#Creating extra useful columns for analysis
df_bookinfo['Popularity'] = df_bookinfo['Num_rating'] * df_bookinfo['Avg_rating']
pd.Series([datetime.datetime(2022, 10, 1)] * df_bookinfo.shape[0])
def get_days_passed(row):
return (datetime.datetime(2022, 10, 1) - row['Publication date']).days
df_bookinfo['Days after publishing'] = df_bookinfo.progress_apply(get_days_passed, axis=1)
physical_formats = ['Paperback', 'Hardcover', 'Mass Market Paperback', 'Spiral-bound', 'Library Binding'] #side by side boxblot / violin plot to compare between these
ebook_formats = ['Kindle', 'Kindle & Comixology']
audiobook_formats = ['Audiobook', 'Audio CD', 'MP3 CD']
def assign_main_format(row):
if row.Format in physical_formats:
return 'Physical'
elif row.Format in ebook_formats:
return 'Ebook'
elif row.Format in audiobook_formats:
return 'Audiobook'
else:
return np.nan
df_bookinfo['Main format'] = df_bookinfo.progress_apply(assign_main_format, axis=1)
df_bookinfo
0%| | 0/51936 [00:00<?, ?it/s]
0%| | 0/51936 [00:00<?, ?it/s]
0%| | 0/51936 [00:00<?, ?it/s]
0%| | 0/49080 [00:00<?, ?it/s]
0%| | 0/5 [00:00<?, ?it/s]
0%| | 0/49080 [00:00<?, ?it/s]
0%| | 0/49080 [00:00<?, ?it/s]
0%| | 0/47344 [00:00<?, ?it/s]
0%| | 0/44780 [00:00<?, ?it/s]
0%| | 0/44780 [00:00<?, ?it/s]
| Title | Publisher | Publication date | File size | Text-to-Speech | Enhanced typesetting | X-Ray | Word Wise | Lending | Book Url | ... | Num_rating | Item Weight | Screen Reader | Listening Length | Whispersync for Voice | X-Ray for textbooks | Num Page | Popularity | Days after publishing | Main format | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Manmade Wonders of the World | DK | 2019-10-08 | 262248.0 | False | False | False | False | False | /Manmade-Wonders-World-Dan-Cruickshank-ebook/d... | ... | 28.0 | NaN | NaN | NaN | None | NaN | 415.0 | 131.6 | 1089 | Ebook |
| 1 | Man-Made Wonders of the World | DK | 2019-10-08 | NaN | None | None | None | None | None | /Manmade-Wonders-World-Dan-Cruickshank-ebook/d... | ... | 417.0 | 83.20 | NaN | NaN | None | NaN | 416.0 | 2001.6 | 1089 | Physical |
| 2 | Architecture: A Visual History | DK | 2017-11-21 | NaN | None | None | None | None | None | /Architecture-Visual-History-Jonathan-Glancey/... | ... | 607.0 | 60.64 | NaN | NaN | None | NaN | 416.0 | 2886.1 | 1775 | Physical |
| 3 | The 99% Invisible City: A Field Guide to the H... | Dey Street Books | 2020-10-06 | NaN | None | None | None | None | None | /99-Invisible-City-Hidden-Everyday/dp/03581266... | ... | 2762.0 | 36.80 | NaN | NaN | None | NaN | 400.0 | 13257.6 | 725 | Physical |
| 4 | The 99% Invisible City: A Field Guide to the H... | Mariner Books | 2020-10-06 | 64102.0 | True | True | True | True | False | /99-Invisible-City-Hidden-Everyday/dp/03581266... | ... | 152.0 | NaN | Supported | NaN | None | NaN | 403.0 | 638.4 | 725 | Ebook |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 44775 | The Ultimate Washington Football Team Trivia B... | NaN | 2021-09-16 | 3681.0 | True | True | False | True | True | /Ultimate-Washington-Football-Team-Trivia/dp/1... | ... | 17.0 | NaN | Supported | NaN | None | NaN | 116.0 | 81.6 | 380 | Ebook |
| 44776 | The Ultimate Washington Capitals Trivia Book: ... | HRP House | 2020-10-02 | NaN | None | None | None | None | None | /Ultimate-Washington-Capitals-Trivia-Book/dp/1... | ... | 1164.0 | 7.50 | NaN | NaN | None | NaN | 153.0 | 5448.8 | 729 | Physical |
| 44777 | The Ultimate New York Yankees Trivia Book: A C... | NaN | 2020-08-19 | 3577.0 | True | True | False | True | True | /Ultimate-Washington-Capitals-Trivia-Book/dp/1... | ... | 53.0 | NaN | Supported | NaN | None | NaN | 163.0 | 235.7 | 773 | Ebook |
| 44778 | The Ultimate New York Yankees Trivia Book: A C... | HRP House | 2021-02-25 | NaN | None | None | None | None | None | /Ultimate-Washington-Capitals-Trivia-Book/dp/1... | ... | 0.0 | NaN | NaN | 263.0 | False | NaN | NaN | 0.0 | 583 | Audiobook |
| 44779 | Hey Kids! Let's Journal Washington DC: A Kids ... | Life Experiences Publishing | 2020-03-23 | NaN | None | None | None | None | None | /Hey-Kids-Lets-Journal-Washington/dp/194604908... | ... | 10.0 | 6.70 | NaN | NaN | None | NaN | 77.0 | 41.0 | 922 | Physical |
44780 rows × 23 columns
Next, I will categorise the remaining columns into the following categories
The columns themselves are mostly self-explanatory. The remaining columns refer to the following
Additionally, I have compiled all the book formats by their main format. That is physical book, ebook, and audiobook.
general_params = ['Title', 'Publisher', 'Publication date', 'Book Url', 'Price', 'Format', 'Avg_rating', 'Num_rating', 'Popularity', 'Days after publishing', 'Main format']
physical_params = ['Num Page', 'Item Weight']
audiobook_params = ['Listening Length', 'Whispersync for Voice']
ebook_params = ['File size', 'Text-to-Speech', 'Enhanced typesetting', 'X-Ray', 'Word Wise', 'Lending', 'Num Page']
physical_formats = ['Paperback', 'Hardcover', 'Mass Market Paperback', 'Spiral-bound', 'Library Binding']
ebook_formats = ['Kindle', 'Kindle & Comixology']
audiobook_formats = ['Audiobook', 'Audio CD', 'MP3 CD']
#df_bookinfo.columns
df_bookinfo['Whispersync for Voice'].value_counts()#.index.to_list()
#df_bookinfo[df_bookinfo['Whispersync for Voice'] == 'Ready'][['Format', 'Whispersync for Voice']]['Format']
#df_bookinfo.loc[df_bookinfo['Narration accent'].notna()]['Narration accent'].value_counts()
#df_bookinfo['Version'].value_counts()
#df_bookinfo['Program Type'].value_counts()
#df_bookinfo.loc[~df_bookinfo['Author'].isna()][['Author', 'Contributor 1 name', 'Contributor 1 role']]
True 4003 False 1885 Name: Whispersync for Voice, dtype: int64
In this dataset, I found that some international reviews did not have dates. While removing these reviews would leave international reviewers with less representation, I found that the reviews that had this issue was a small proportion of all reviews (<0.1%) so I removed them.
Additionally, I split the original 'Date' column into 'Date' and 'Country' which is more useful. I also bin the dates as the dates span from 1997 to 2022 and leaving it as is would lead to too many values on the x-axis and visualisation issues.
df_reviews.dropna(subset=["Date"], axis=0, inplace= True)
df_reviews.reset_index(drop=True, inplace=True)
df_reviews['Rating'] = df_reviews['Rating'].str.split(' ').str.get(0).astype('float')
def get_country(row):
if row.Date is None:
return None
#a = row.Date
#print(row.Date)
thing = row.Date.split(' ')
if thing[2] == 'the':
if thing[3] == 'Netherlands':
return thing[3]
else:
return thing[3] + ' ' + thing[4]
else:
return thing[2]
df_reviews['Country'] = df_reviews.progress_apply(get_country, axis=1)
d = {'January':'1', 'February':'2', 'March':'3', 'April':'4', 'May':'5', 'June':'6',
'July':'7', 'August':'8', 'September':'9', 'October':'10', 'November':'11', 'December':'12'}
month = df_reviews['Date'].str.strip().str.split(' ').str.get(-3).map(d)
day = df_reviews['Date'].str.strip().str.split(' ').str.get(-2).str[:-1] #remove the , at the end of day
year = df_reviews['Date'].str.strip().str.split(' ').str.get(-1)
df_reviews['Date'] = pd.to_datetime(month + '/' + day + '/' + year, infer_datetime_format=True)
#bins date by month as the number of reviews change greatly between days, so it's harder to see a trend
def bin_date(row):
date = row.Date
return datetime.datetime(date.year, date.month, 1)
df_reviews['Date Bin'] = df_reviews.progress_apply(bin_date, axis=1)
df_reviews.head()
0%| | 0/4790835 [00:00<?, ?it/s]
0%| | 0/4790835 [00:06<?, ?it/s]
| Book URL | Num Ratings | Num Reviews | Name | Account ID | Rating | Date | Country | Date Bin | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | /Manmade-Wonders-World-Dan-Cruickshank-ebook/d... | 418 | 36 | Kindle Customer | AERSA5VTXDB2WRV2RVWZ2MUDVDVQ | 4.0 | 2022-09-24 | United States | 2022-09-01 |
| 1 | /Manmade-Wonders-World-Dan-Cruickshank-ebook/d... | 418 | 36 | Marshal Mueller | AH7TCO74B46RIFI53NFOOFXFTLFA | 5.0 | 2022-02-06 | United States | 2022-02-01 |
| 2 | /Manmade-Wonders-World-Dan-Cruickshank-ebook/d... | 418 | 36 | Hammad Iqbal | AGJIIPF56XA2AMEMXKYNPLYYRGQA | 5.0 | 2021-11-02 | United States | 2021-11-01 |
| 3 | /Manmade-Wonders-World-Dan-Cruickshank-ebook/d... | 418 | 36 | Edith Leet | AEW56N2K52FVCR7GKSKI3AWV64RA | 5.0 | 2021-12-11 | United States | 2021-12-01 |
| 4 | /Manmade-Wonders-World-Dan-Cruickshank-ebook/d... | 418 | 36 | Kiki | AFDPHQHRSOCGY73XICFL6QGG2IVQ | 4.0 | 2021-07-08 | United States | 2021-07-01 |
df = df_bookinfo.copy()
df.dropna(subset=['Popularity', 'Price'], axis=0, inplace=True)
df = df.loc[(df['Price'] != 0) & (df['Price'] < 100)] #There are some books that have ridiculous prices which were probably listed wrong (a book was priced at $35,000)
y = df['Popularity']
x = df['Price']
plt.figure(figsize=(20,10))
plt.scatter(x,y, alpha=0.3)
#sns.scatterplot(x='Price', y='Popularity', data=df, alpha=0.3)#, scatter_kws={'alpha':0.2})#, alpha=0.2)
plt.title('Scatterplot of Popularity vs Price')
plt.xlabel('Price')
plt.ylabel('Popularity')
plt.ylim(0,700000)
plt.xlim(0,100)
plt.show()
plt.figure(figsize=(20,10))
sns.kdeplot(data=x)
plt.title('Kde plot of books by price')
plt.xlim(0,100)
plt.show()
As seen in the graph above, There isn't an obvious trend between price and popularity. However, popularity seems to have some correlation with low prices, with there being a clump of data points of higher popularity at lower prices. Most books still have little to no popularity, resulting in a clear line of points at around popularity = 0.
df = df_bookinfo.copy()
df.dropna(subset=['Popularity', 'Days after publishing'], axis=0, inplace=True)
df = df.loc[df['Days after publishing'] < 10000] #about number of days since amazon was founded ()
y = df['Popularity']
x = df['Days after publishing']
plt.figure(figsize=(20,10))
plt.scatter(x,y, alpha=0.3)
plt.title('Scatterplot of Popularity vs Days after publishing, up to 10000 days')
plt.xlabel('Days after publishing')
plt.ylabel('Popularity')
df = df.loc[df['Days after publishing'] < 150]
y = df['Popularity']
x = df['Days after publishing']
plt.figure(figsize=(20,10))
plt.scatter(x,y, alpha=0.1)
#sns.scatterplot(x='Price', y='Days after publishing', data=df, alpha=0.3)#, scatter_kws={'alpha':0.2})#, alpha=0.2)
plt.title('Scatterplot of Popularity vs Days after publishing up ti 150 days')
plt.xlabel('Days after publishing')
plt.ylabel('Popularity')
plt.show()
This result is unexpected as it shows that the popularity of a book is roughly the same no matter how long it has been available for purchase, which contradicts the common sense that as books are released for longer, they leave more time for users to purchase the book and thus increase its Popularity by leaving a review. A more interesting characteristic is also observed when we zoom into the early days of a book's publication where it is observed that many books are released on specifically Tuesday, leading to more popular books being published on this day. However, even when zooming it, it appears that there is no clear trend in a book's popularity based on how long ago it was published.
df = df_bookinfo.copy()
df.dropna(subset=['Popularity', 'Main format'], axis=0, inplace=True)
df['Main format'].value_counts()
plt.figure(figsize=(15,10))
sns.stripplot(y=df["Popularity"], x=df["Main format"], data=df, jitter = 0.25, alpha=0.4)
plt.title('Stripplot of Popularity by Main format')
plt.show()
#lang[lang > 10]
From this graph, we can see that Audiobooks see much less popularity than their Ebook and Pysical book counterparts. Let's analyse this further into the spread in their individual formats.
Due to the presence of many outliers, the removal of which would likely skew the data, to have a better picture of the success of individual formats, we'll analyse how many % of the books exceed a certain popularity threshold, in this case set to 1000.
df_list = []
format_list = physical_formats + ebook_formats + audiobook_formats
for a in format_list:
df_list.append(df.loc[df['Format'] == a])
nrow = 2
ncol = 5
fig = plt.figure(figsize=(20,10))
gs = fig.add_gridspec(nrow, ncol, hspace=0, wspace=0)
axes = gs.subplots(sharex=True, sharey=True)
count = 0
for r in range(nrow):
for c in range(ncol):
series = df_list[count]['Popularity'] > 1000
temp = axes[r,c].pie(series.value_counts().sort_index(ascending=False).to_list(),
autopct='%1.1f%%')
axes[r,c].set_title(format_list[count])
axes[r,c].text(0, -1.2, f'Total: {len(series)} Books', ha='center', va='center')
if r == 0 and c == 4:
axes[r,c].legend(['Success', 'Yet to succeed'], loc='upper right')
count += 1
plt.suptitle('Proportion of successful books by format', x=0.5, y=0.95, fontsize=18)
plt.show()
We can do a similar analysis to publishers as we have for formats. First, as there are over 5000 publishers in the dataset, let's narrow them down into the total popularity of their books and do analysis on the most substantial publishers, those able to consistently publish books of high popularity, and will therefore be using sum. I will individually analyse the top 9 publishers and recategorise the rest as 'other'.
publisher_pop = df_bookinfo.groupby('Publisher')['Popularity'].sum().sort_values(ascending=False).to_frame()
publisher_pop.head(9)
| Popularity | |
|---|---|
| Publisher | |
| Penguin Books | 3.511401e+06 |
| Independently published | 3.389138e+06 |
| Rockridge Press | 3.269160e+06 |
| Ballantine Books | 2.549849e+06 |
| Simon & Schuster | 2.418214e+06 |
| Crown | 2.404346e+06 |
| Random House Books for Young Readers | 2.184023e+06 |
| HarperOne | 2.121982e+06 |
| Bloomsbury Publishing | 2.057787e+06 |
df = df_bookinfo.copy()
top_9 = publisher_pop.index.to_list()[:9]
def differentiate_publisher(row):
if row.Publisher not in top_9:
return 'Other'
else:
return row.Publisher
df['Publisher'] = df.progress_apply(differentiate_publisher, axis=1)
plt.figure(figsize=(20,10))
sns.boxplot(y=df['Popularity'], x=df['Publisher'], order= top_9 + ['Other'], data=df)
plt.xticks(rotation=30)
plt.title('Side-by-side boxplots of Popularity by Publisher')
plt.show()
0%| | 0/44780 [00:00<?, ?it/s]
To get a better grasp of this information, let's zoom into the distributions.
df = df_bookinfo.copy()
top_9 = publisher_pop.index.to_list()[:9]
def differentiate_publisher(row):
if row.Publisher not in top_9:
return 'Other'
else:
return row.Publisher
df['Publisher'] = df.progress_apply(differentiate_publisher, axis=1)
plt.figure(figsize=(20,10))
sns.boxplot(y=df['Popularity'], x=df['Publisher'], order= top_9 + ['Other'], data=df)
sns.stripplot(y=df['Popularity'], x=df['Publisher'], order= top_9 + ['Other'], data=df, color=".1", jitter=0.2, alpha=0.01)
df['Publisher'].value_counts()
plt.xticks(rotation=30)
plt.ylim(0,65000)
plt.title('Side-by-side boxplots of Popularity by Publisher (Zoomed in)')
plt.show()
0%| | 0/44780 [00:00<?, ?it/s]
I feel that item weight would be a measure of the physical quality of a book, as different quality materials should weigh different amounts. This serves as a measure to see if the physical quality of a book effects its popularity.
df_format = df_bookinfo.loc[df_bookinfo['Main format'] == 'Physical'][general_params + physical_params]
df_format.dropna(subset=['Popularity', 'Item Weight'], axis=0, inplace=True)
df_format = df_format.loc[df_format['Item Weight'] < 500] #remove big outliers
#https://www.geeksforgeeks.org/contour-plot-using-matplotlib-python/
#plt.contour(data = df_format, x = 'Item Weight', y = 'Popularity')#, kind = 'kde')
#sns.jointplot(data = df_format, x = 'Item Weight', y = 'Popularity', kind = 'kde')
#px.density_contour(df_format, x = 'Num Page', y = 'Popularity')
#https://stackoverflow.com/questions/35042255/how-to-plot-multiple-seaborn-jointplot-in-subplot
fig, axes = plt.subplots(1, 2, figsize=(20, 10))
sns.kdeplot(data=df_format, x = 'Item Weight', y = 'Popularity', ax=axes[0])
sns.scatterplot(data=df_format, x = 'Item Weight', y = 'Popularity', ax=axes[0], alpha=0.05)
sns.kdeplot(data=df_format, x = 'Item Weight', y = 'Popularity', ax=axes[1])
axes[0].set_title('Contour plot between Item Weight and Popularity')
axes[1].set_xlim([0, 100])
axes[1].set_ylim([0,48000])
axes[1].set_title('Contour plot between Item Weight and Popularity, zoomed in')
fig.show()
pearson_coef, p_value = stats.pearsonr(df_format['Item Weight'], df_format['Popularity'])
print(f'pearson_coef: {pearson_coef}, p_value: {p_value}')
C:\Users\BRANDONNATHANIELYAUR\AppData\Local\Temp\ipykernel_15404\1177133930.py:23: UserWarning: Matplotlib is currently using module://matplotlib_inline.backend_inline, which is a non-GUI backend, so cannot show the figure. fig.show()
pearson_coef: 0.01632471039123131, p_value: 0.013557157575442232
df_format = df_bookinfo.loc[(df_bookinfo['Main format'] == 'Physical') | (df_bookinfo['Main format'] == 'Ebook')][general_params + ['Num Page']]
df_format.dropna(subset=['Num Page', 'Popularity'], axis=0, inplace=True)
#plt.figure(figsize=(10,10))
#sns.regplot(x='Num Page', y='Popularity', data=df_format, scatter_kws={'alpha':0.15})
#inbound_heritage_contour = px.density_contour(df_format, x = 'Num Page', y = 'Popularity', title = 'Contour plot of Popularity against Num Page').update_traces(contours_coloring="fill", contours_showlabels = True)
#inbound_heritage_contour
#sns.jointplot(data = df_format, x = 'Num Page', y = 'Popularity', kind = 'kde')
#plt.show()
fig, axes = plt.subplots(1, 2, figsize=(20, 10))
sns.kdeplot(data=df_format, x = 'Num Page', y = 'Popularity', ax=axes[0])
sns.scatterplot(data=df_format, x = 'Num Page', y = 'Popularity', ax=axes[0], alpha=0.05)
sns.kdeplot(data=df_format, x = 'Num Page', y = 'Popularity', ax=axes[1])
axes[0].set_title('Contour plot and scatterplot between Num Page and Popularity')
axes[1].set_xlim([0, 1350])
axes[1].set_ylim([0,42000])
axes[1].set_title('Contour plot between Num Page and Popularity, zoomed in')
fig.show()
pearson_coef, p_value = stats.pearsonr(df_format['Num Page'], df_format['Popularity'])
print(f'pearson_coef: {pearson_coef}, p_value: {p_value}')
C:\Users\BRANDONNATHANIELYAUR\AppData\Local\Temp\ipykernel_15404\3731809823.py:24: UserWarning: Matplotlib is currently using module://matplotlib_inline.backend_inline, which is a non-GUI backend, so cannot show the figure. fig.show()
pearson_coef: 0.032989031136952236, p_value: 1.2893432096558919e-09
df_format = df_bookinfo.loc[df_bookinfo['Main format'] == 'Ebook'][general_params + ebook_params]
df_format.dropna(subset=['Popularity', 'File size'], axis=0, inplace=True)
#df_format = df_format.loc[df_format['Num'] < 500] #remove big outlier, books shouldn't weigh this much
#plt.figure(figsize=(10,10))
#sns.regplot(x='File size', y='Popularity', data=df_format, scatter_kws={'alpha':0.15})
fig, axes = plt.subplots(1, 2, figsize=(20, 10))
sns.kdeplot(data=df_format, x = 'File size', y = 'Popularity', ax=axes[0])
sns.scatterplot(data=df_format, x = 'File size', y = 'Popularity', ax=axes[0], alpha=0.05)
sns.kdeplot(data=df_format, x = 'File size', y = 'Popularity', ax=axes[1])
axes[0].set_title('Contour plot and scatterplot between File size and Popularity')
axes[1].set_xlim([0, 500000])
axes[1].set_ylim([0, 60000])
axes[1].set_title('Contour plot between File size and Popularity, zoomed in')
fig.show()
pearson_coef, p_value = stats.pearsonr(df_format['File size'], df_format['Popularity'])
pearson_coef, p_value
C:\Users\BRANDONNATHANIELYAUR\AppData\Local\Temp\ipykernel_15404\4003525649.py:19: UserWarning: Matplotlib is currently using module://matplotlib_inline.backend_inline, which is a non-GUI backend, so cannot show the figure. fig.show()
(-0.04217707263255553, 4.006538504306089e-06)
df_format = df_bookinfo.loc[df_bookinfo['Main format'] == 'Audiobook'][general_params + audiobook_params]
df_format.dropna(subset=['Popularity', 'Listening Length'], axis=0, inplace=True)
#df_format = df_format.loc[df_format['Num'] < 500] #remove big outlier, books shouldn't weigh this much
#plt.figure(figsize=(10,10))
#sns.regplot(x='File size', y='Popularity', data=df_format, scatter_kws={'alpha':0.15})
fig, axes = plt.subplots(1, 2, figsize=(20, 10))
sns.kdeplot(data=df_format, x = 'Listening Length', y = 'Popularity', ax=axes[0])
sns.scatterplot(data=df_format, x = 'Listening Length', y = 'Popularity', ax=axes[0], alpha=0.05)
sns.kdeplot(data=df_format, x = 'Listening Length', y = 'Popularity', ax=axes[1])
axes[0].set_title('Contour plot and scatterplot between Listening Length and Popularity')
axes[1].set_xlim([0, 2200])
axes[1].set_ylim([0, 1200])
axes[1].set_title('Contour plot between Listening Length and Popularity, zoomed in')
fig.show()
pearson_coef, p_value = stats.pearsonr(df_format['Listening Length'], df_format['Popularity'])
pearson_coef, p_value
C:\Users\BRANDONNATHANIELYAUR\AppData\Local\Temp\ipykernel_15404\90937288.py:19: UserWarning: Matplotlib is currently using module://matplotlib_inline.backend_inline, which is a non-GUI backend, so cannot show the figure. fig.show()
(0.12227676451103892, 4.7191457386625465e-21)
As can be seen in these graphs and the pearson coefficient, these parameters do not have a good correlation for each other. However, it can be observed that for parameters other than file size, there is a range of values that most books fall under where they have a higher popularity. Additionally, from their p-values, they each have moderate to strong certainty in this relationship.
df_format = df_bookinfo.loc[df_bookinfo['Main format'] == 'Ebook'][general_params + ebook_params]
accessibility_list = ['Text-to-Speech', 'Enhanced typesetting', 'X-Ray', 'Word Wise', 'Lending']
nrow = 1
ncol = 5
#Not zoomed in
fig = plt.figure(figsize=(20,10))
gs = fig.add_gridspec(nrow, ncol, hspace=0, wspace=0)
axes = gs.subplots(sharex=True, sharey=True)
count = 0
for r in range(nrow):
for c in range(ncol):
df = df_format.dropna(subset=['Popularity'] + [accessibility_list[c]], axis=0)
sns.boxplot( y=df["Popularity"], x=df[accessibility_list[c]], ax=axes[c])#, hue='body-style')
for ax in axes.flat:
ax.label_outer()
plt.suptitle('Box plots of popularity by presence of accessibility features')
fig.supxlabel('Accessibility features')
plt.show()
#Zoomed in
fig = plt.figure(figsize=(20,10))
gs = fig.add_gridspec(nrow, ncol, hspace=0, wspace=0)
axes = gs.subplots(sharex=True, sharey=True)
count = 0
for r in range(nrow):
for c in range(ncol):
df = df_format.dropna(subset=['Popularity'] + [accessibility_list[c]], axis=0)
sns.boxplot( y=df["Popularity"], x=df[accessibility_list[c]], ax=axes[c])#, hue='body-style')
axes[c].set_ylim([0,10000])
for ax in axes.flat:
ax.label_outer()
plt.suptitle('Box plots of popularity by presence of accessibility features, zoomed in')
fig.supxlabel('Accessibility features')
plt.show()
As seen in the boxplots, for the most part, the availability of these accessibility features is correlated with an increase in the book's Popularity, having a higher median, Q1 and Q3 than if they were unavailable. This is with the exception of Lending where the opposite is true.
df_format = df_bookinfo.loc[df_bookinfo['Main format'] == 'Audiobook'][general_params + audiobook_params]
sns.boxplot( y=df_format["Popularity"], x=df_format['Whispersync for Voice'])#, hue='body-style')
plt.title('Boxplot of Popularity by availability of Whispersync for Voice')
plt.show()
sns.boxplot( y=df_format["Popularity"], x=df_format['Whispersync for Voice'])#, hue='body-style')
plt.ylim(0,330)
plt.title('Boxplot of Popularity by availability of Whispersync for Voice, zoomed in')
plt.show()
Similar to accessibility features for Ebooks, the availability of Whispersync for Voice is correlated with an increase in the book's Popularity, having a higher median, Q1 and Q3 than if it was unavailable
With the only metric available that is assigned a date, reviews is the only way to connect ratings, and thus popularity, with time. Below I try to establish reviews as a good metric to estimate rating. First, let's find the rating sum to estimate the rating of a book
df_ratingavg = df_reviews.groupby(['Book URL'])['Rating'].mean().to_frame().reset_index()
df_formatless_rating = df_bookinfo.groupby(['Book Url'])['Popularity'].sum()
df_formatless_num = df_bookinfo.groupby(['Book Url'])['Num_rating'].sum()
df_formatless_rating = (df_formatless_rating / df_formatless_num).to_frame().reset_index()
df_formatless_rating = pd.merge(df_formatless_rating, df_formatless_num.to_frame().reset_index(), on='Book Url')
def bin_num_rating(row):
if row['Num_rating'] >= 100000:
return '100000+'
elif row['Num_rating'] >= 10000:
return '10000+'
elif row['Num_rating'] >= 1000:
return '1000+'
elif row['Num_rating'] >= 100:
return '100+'
elif row['Num_rating'] >= 10:
return '10+'
else: return '<10'
df_formatless_rating['Num_rating'] = df_formatless_rating.apply(bin_num_rating, axis=1)
df_compare = pd.merge(df_ratingavg, df_formatless_rating, left_on='Book URL', right_on='Book Url', how='inner')
df_compare.drop(['Book URL', 'Book Url'], axis=1, inplace=True)
#fig = plt.figure(figsize=(10,10))
#plt.plot([1,5], [1,5], color="red")
#sns.scatterplot(0,'Rating', hue='Num_rating', data=df_compare)#, alpha=0.3)
#plt.plot()
nrow = 2
ncol = 3
fig = plt.figure(figsize=(24,16))
gs = fig.add_gridspec(nrow, ncol, hspace=0.1, wspace=0.1)
axes = gs.subplots(sharex=True, sharey=True)
bin_num = ['<10', '10+', '100+', '1000+', '10000+', '100000+']
count = 0
for r in range(nrow):
for c in range(ncol):
df = df_compare.loc[df_compare['Num_rating'] == bin_num[count]]
axes[r,c].plot([1,5], [1,5], color="red")
axes[r,c].set_title(bin_num[count])
sns.scatterplot(x=0,y='Rating', data=df, ax = axes[r,c])#, alpha=0.3)
axes[r,c].set_xlabel('Actual value')
axes[r,c].set_ylabel('Predicted value')
count += 1
for ax in axes.flat:
ax.label_outer()
plt.suptitle('Predicted vs Actual value of Avg_rating', x=0.5, y=0.95, fontsize=18)
plt.show()
As can be seen in the above graphs, reviews can predict the rating rather well, especially for books whose number of rating is above 100. However, it can be seen that the estimate tends to underestimate the actual average rating of the book. This is expected as the proportion of people who leave a review are different depending on their rating, with people who give a bad rating more likely to leave a review than someone who left a good rating (individually verified by checking a book's review page and sorting by star). Ideally, a model for this proportion could have been created to better estimate the actual value from reviews. However, due to the lack of time, I will use this system instead.
df_try = df_reviews.copy()
df_try = df_try.set_index(['Book URL', 'Num Ratings', 'Num Reviews'])
map_ = pd.DataFrame(list(np.unique(df_try.index))).apply(pd.Series)
map_.columns = ['Book URL', 'Num Ratings', 'Num Reviews']
map_ = map_.set_index('Book URL').T.to_dict('list')
df_ratingsum = df_reviews.groupby(["Book URL", "Date Bin"])["Rating"].sum().to_frame().reset_index()
def adjust_for_truncate(row):
num_review = map_[row['Book URL']][1] #get the number of reviews
a = row.Rating
if (num_review > 5000):
a = a * num_review / 5000
return a
df_ratingsum['Rating'] = df_ratingsum.progress_apply(adjust_for_truncate, axis=1)
df_ratingsum = df_ratingsum.rename(columns={'Rating':'Estimated Popularity'})
df_ratingsum.sort_values(by="Estimated Popularity",ascending=False) #plot the sums of the ratings against the ratings of the book
df_ratingsum
0%| | 0/737138 [00:00<?, ?it/s]
| Book URL | Date Bin | Estimated Popularity | |
|---|---|---|---|
| 0 | /%C2%A1Gool-El-sueno-inicia-Spanish/dp/0307277... | 2020-09-01 | 5.0 |
| 1 | /%C2%A1Gool-El-sueno-inicia-Spanish/dp/0307277... | 2022-07-01 | 2.0 |
| 2 | /%C2%A1avancemos-Student-Level-2013-Spanish/dp... | 2014-11-01 | 15.0 |
| 3 | /%C2%A1avancemos-Student-Level-2013-Spanish/dp... | 2015-09-01 | 5.0 |
| 4 | /%C2%A1avancemos-Student-Level-2013-Spanish/dp... | 2016-07-01 | 5.0 |
| ... | ... | ... | ... |
| 737133 | /xchg-rax-xorpd/dp/1502958082/ref=sr_1_6?qid=1... | 2021-09-01 | 5.0 |
| 737134 | /xchg-rax-xorpd/dp/1502958082/ref=sr_1_6?qid=1... | 2021-10-01 | 5.0 |
| 737135 | /xchg-rax-xorpd/dp/1502958082/ref=sr_1_6?qid=1... | 2022-04-01 | 5.0 |
| 737136 | /xchg-rax-xorpd/dp/1502958082/ref=sr_1_6?qid=1... | 2022-06-01 | 4.0 |
| 737137 | /xchg-rax-xorpd/dp/1502958082/ref=sr_1_6?qid=1... | 2022-07-01 | 5.0 |
737138 rows × 3 columns
df_withgenre = pd.merge(df_searched, df_ratingsum, on='Book URL', how='right')
df_withgenre = df_withgenre.groupby(['Main Genre', 'Date Bin'])[['Estimated Popularity']].sum()
df_withgenre.reset_index(inplace=True)
df_withgenre = df_withgenre.loc[(df_withgenre['Date Bin'] < datetime.datetime(2022, 10, 1))] #Remove 2022 October as a bin as the month just started
df_withgenre
| Main Genre | Date Bin | Estimated Popularity | |
|---|---|---|---|
| 0 | Arts & Photography | 1997-11-01 | 5.0 |
| 1 | Arts & Photography | 1998-08-01 | 10.0 |
| 2 | Arts & Photography | 1998-10-01 | 5.0 |
| 3 | Arts & Photography | 1998-12-01 | 10.0 |
| 4 | Arts & Photography | 1999-01-01 | 5.0 |
| ... | ... | ... | ... |
| 7688 | Travel | 2022-05-01 | 3749.0 |
| 7689 | Travel | 2022-06-01 | 3451.0 |
| 7690 | Travel | 2022-07-01 | 3986.0 |
| 7691 | Travel | 2022-08-01 | 4103.0 |
| 7692 | Travel | 2022-09-01 | 4157.0 |
7665 rows × 3 columns
df_ratingsum_alltime = df_reviews.groupby(["Book URL"])["Rating"].sum().to_frame().reset_index()
df_ratingsum_alltime['Rating'] = df_ratingsum_alltime.progress_apply(adjust_for_truncate, axis=1)
df_withgenre_alltime = pd.merge(df_searched, df_ratingsum_alltime, on='Book URL', how='right')
df_total = df_withgenre_alltime.groupby('Main Genre')[['Rating']].sum().sort_values(by='Rating', ascending=False)
df_top5_alltime = df_total.iloc[:5]
top5_list = list(df_top5_alltime.index)
top5_list
0%| | 0/20013 [00:00<?, ?it/s]
['Literature & Fiction', "Children's Books", 'Teen & Young Adult', 'Health, Fitness & Dieting', 'Biographies & Memoirs']
#Converting df_withgenre to a form that allows us to plot area plot
df_temp = df_withgenre.copy()
test = df_temp['Date Bin'].unique()
test.sort()
def change_to_other(row):
if row['Main Genre'] not in top5_list:
return 'Other'
else:
return row['Main Genre']
df_temp['Main Genre'] = df_temp.progress_apply(change_to_other, axis=1)
df_temp = df_temp.groupby(['Main Genre', 'Date Bin'])[['Estimated Popularity']].sum() #Rename this to estimated popularity
df_temp.reset_index(inplace=True)
df_temp = df_temp.set_index('Date Bin')
df_list = []
for date_ in test:
a = df_temp.loc[date_]
if isinstance(a['Main Genre'], str):
aaaa = [a['Main Genre']]
bbbb = [a['Estimated Popularity']]
else:
aaaa = a['Main Genre'].to_list()
bbbb = a['Estimated Popularity'].to_list()
df_list.append(pd.DataFrame([bbbb], index=[date_], columns=aaaa))
df_area = pd.concat(df_list, axis=0)
df_area = df_area[['Teen & Young Adult', 'Biographies & Memoirs', 'Health, Fitness & Dieting', "Children's Books",
'Literature & Fiction', 'Other']]
df_area.plot(kind='area',
stacked=False,
figsize=(20, 10))
plt.xlabel('Year')
plt.ylabel('Estimated Popularity')
plt.title('Graph of Estimated Popularity of Top 5 Most Popular Genres over time, unstacked')
plt.show()
df_area.plot(kind='area',
stacked=True,
figsize=(20, 10))
plt.xlabel('Year')
plt.ylabel('Estimated Popularity')
plt.title('Graph of Estimated Popularity of Top 5 Most Popular Genres over time, stacked')
plt.show()
0%| | 0/7665 [00:00<?, ?it/s]
Above is the graph of the number of reviews over time. From this, we can see that the popularity book genres generally happen to every book, with peaks and dips in estimated popularity mostly occuring at around the same time for each genre. Additionally, the stacked area plot shows us the total estimated popularity over time, with its meteoric rise between 2009 and 2012 and thereafter fluctuating around a roughly constant value.
Next, we investigate the relative popularity, or the proportion of estimated popularity to the total estimated popularity of the same date bin, for each genre over time.
df_total = df_withgenre.groupby('Date Bin')[['Estimated Popularity']].sum().reset_index() #Total from every bin
def find_relative_popularity(row):
total_rating = df_total.loc[df_total['Date Bin'] == row['Date Bin']].iloc[0]['Estimated Popularity']
return row['Estimated Popularity'] / total_rating
df_list = []
for index_ in tqdm(index):
df_indivgenre = df_withgenre.loc[df_withgenre['Main Genre'] == index_].copy()
df_indivgenre['Relative Popularity'] = df_indivgenre.apply(find_relative_popularity, axis=1)
df_list.append(df_indivgenre)
nrow = 6
ncol = 5
fig = plt.figure(figsize=(20,16))
gs = fig.add_gridspec(nrow, ncol, hspace=0, wspace=0)
axes = gs.subplots(sharex=True, sharey=True)
count = 0
for r in range(nrow):
for c in range(ncol):
df_list[count].plot(ax=axes[r,c], x='Date Bin', y='Relative Popularity')
axes[r,c].text(datetime.datetime(2000, 1, 1), 0.8, index[count])
axes[r,c].get_legend().remove()
axes[r,c].set_xlabel('')
count += 1
for ax in axes.flat:
ax.label_outer()
fig.supxlabel('Date', x=0.5, y=0.15, fontsize=12)
fig.supylabel('Relative popularity', x=0.09, y=0.45, fontsize=12)
plt.suptitle('Relative popularity over time for every genre', x=0.5, y=0.9, fontsize=15)
plt.show()
# Removing the October 2022 as the month has just begun
#plt.figure(figsize=(20,10))
#a = sns.lineplot('Date Bin', 'Rating', hue='Main Genre', data=df_withgenre, ci=None)
#a.legend(loc='upper right', bbox_to_anchor=(1.1, 1))
#plt.show()
#do impact proportional to number of books
0%| | 0/30 [00:00<?, ?it/s]
Next I also find the distribution of books for each genre for the most recent month.
df_all.loc['Literature & Fiction'].index
df_withgenre = pd.merge(df_searched, df_ratingsum, on='Book URL', how='right')
df_recent = df_withgenre.loc[(df_withgenre['Date Bin'] == datetime.datetime(2022, 9, 1))] #Remove 2022 October as a bin as the month just started
df_recent.reset_index(drop=True, inplace=True)
plt.figure(figsize=(20,15))
sns.boxplot(x=df_recent['Estimated Popularity'], y=df_recent['Main Genre'], orient='h')
plt.title('Distribution of Estimated Popularity by genre')
plt.show()
plt.figure(figsize=(20,15))
sns.boxplot(x=df_recent['Estimated Popularity'], y=df_recent['Main Genre'], orient='h')
plt.xlim(0,250)
plt.title('Distribution of Estimated Popularity by genre, zoomed in')
plt.show()
df = df_bookinfo.copy()
df.dropna(subset=['Popularity', 'Price'], axis=0, inplace=True)
df = df.loc[(df['Price'] != 0) & (df['Price'] < 100)] #There are some books that have ridiculous prices which were probably listed wrong (a book was priced at $35,000)
y = df['Popularity']
x = df['Price']
plt.figure(figsize=(20,10))
plt.scatter(x,y, alpha=0.3)
#sns.scatterplot(x='Price', y='Popularity', data=df, alpha=0.3)#, scatter_kws={'alpha':0.2})#, alpha=0.2)
plt.title('Scatterplot of Popularity vs Price')
plt.xlabel('Price')
plt.ylabel('Popularity')
plt.ylim(0,700000)
plt.xlim(0,100)
plt.show()
plt.figure(figsize=(20,10))
sns.kdeplot(data=x)
plt.title('Kde plot of books by price')
plt.xlim(0,100)
plt.show()
pearson_coef, p_value = stats.pearsonr(df['Price'], df['Popularity'])
print(f'pearson_coef: {pearson_coef}, p_value: {p_value}')
pearson_coef: -0.11044983316818657, p_value: 6.759750864066242e-99
As seen from the scatter plot and the pearson R coefficient, there is no clear correlation between popularity and price. However, it should be noted that there seems to be 2 price ranges where popularity seems to spike. one between 0-2 dollars, and one between 6-14 dollars.
The latter spike could be explained by the kde plot, as a large proportion of the books available lie within this price range, by chance, it could be that books within this price range just happen to become more popular, therefore appearing to have a larger popularity.
However, the spike between 0-2 dollars becomes more contradictry as the kde plot shows that the density of books in this range is low compared to between 2-6 dollars where there is a clear drop in popularity. The p_value is also very low at 6.7e-99 which suggests that this result has a high certainty
I believe that this different comes due to something else that amazon offers, subscription services. With kindle unlimited and audible, users are able to access books at a much more discounted price, usually within this price range. As books likely have to go through checks to gain this discount, it is likely that previously popular books were given this price decrease to encourage new users to subscribe.
df = df_bookinfo.copy()
df.dropna(subset=['Popularity', 'Days after publishing'], axis=0, inplace=True)
df = df.loc[df['Days after publishing'] < 10000] #about number of days since amazon was founded ()
y = df['Popularity']
x = df['Days after publishing']
plt.figure(figsize=(20,10))
plt.scatter(x,y, alpha=0.3)
plt.title('No clear trend between popularity and days after publishing')
plt.xlabel('Days after publishing')
plt.ylabel('Popularity')
plt.show()
pearson_coef, p_value = stats.pearsonr(df['Days after publishing'], df['Popularity'])
df = df.loc[df['Days after publishing'] < 150]
y = df['Popularity']
x = df['Days after publishing']
plt.figure(figsize=(20,10))
plt.scatter(x,y, alpha=0.1)
#sns.scatterplot(x='Price', y='Days after publishing', data=df, alpha=0.3)#, scatter_kws={'alpha':0.2})#, alpha=0.2)
plt.title('Spikes of books published every Tuesday that seem to more popular')
plt.xlabel('Days after publishing')
plt.ylabel('Popularity')
plt.show()
print(f'pearson_coef: {pearson_coef}, p_value: {p_value}')
pearson_coef: 0.05509389985176973, p_value: 1.9357885945449015e-30
As seen from the scatter above and from the pearson r coefficient, there again is no clear correlation between Days after publishing and Popularity. This again is contradictory as one would expect a book that had been published for a longer period of time would have been allowed more time to be bought and rated, increasing its popularity. It thus seems that something has allowed for newer books to pick up traction faster, and I believe that this is brought by marketing, although impossible to verify with my datasets alone.
When trying to look into it by zooming into the first few days, I noticed another trend, that more books, and more popular ones at that, are being released on Tuesday. And surely enough, it seems that it has become a tradition which was shown (albeit not conclusively) to boost the marketing of the book. This only aids in my previous hypothesis that this newer books picking up steam quicker is brought by marketing.
df = df_bookinfo.copy()
df.dropna(subset=['Popularity', 'Main format'], axis=0, inplace=True)
df_list = []
format_list = physical_formats + ebook_formats + audiobook_formats
for a in format_list:
df_list.append(df.loc[df['Format'] == a])
df_format_name = ['Physical'] * len(physical_formats) + (['Ebook'] * len(ebook_formats)) + (['Audiobook'] * len(audiobook_formats))
nrow = 2
ncol = 5
fig = plt.figure(figsize=(20,10))
gs = fig.add_gridspec(nrow, ncol, hspace=0, wspace=0)
axes = gs.subplots(sharex=True, sharey=True)
count = 0
for r in range(nrow):
for c in range(ncol):
series = df_list[count]['Popularity'] > 1000
temp = axes[r,c].pie(series.value_counts().sort_index(ascending=False).to_list(),
autopct='%1.1f%%')
axes[r,c].set_title(df_format_name[count] + ': ' + format_list[count])
axes[r,c].text(0, -1.2, f'Total: {len(series)} Books', ha='center', va='center')
if r == 0 and c == 4:
axes[r,c].legend(['Success', 'Yet to succeed'], loc='upper right')
count += 1
plt.suptitle('Proportion of successful books by format', x=0.5, y=0.95, fontsize=18)
fig.show()
C:\Users\BRANDONNATHANIELYAUR\AppData\Local\Temp\ipykernel_15404\1818141506.py:30: UserWarning: Matplotlib is currently using module://matplotlib_inline.backend_inline, which is a non-GUI backend, so cannot show the figure. fig.show()
Note again that a successful book is defined as a book whose popularity is greater than 1000.
As seen from the graph, the formats with the most successful books remain being traditional formats, the physical Papaerback, Hardcover and Mass Market Paperback. More niche physical formats such as spiral bound and library binding are observed to perform worse.
All ebook formats in the dataset; Kindle and Kindle & Comixology were also found to be rather successful.
Audiobooks however, are observed to perform far worse than their physical and ebook counterparts, with each format scoring less than 3% although they have a comparable number of books.
This suggests that audiobooks and its formats have still not been widely accepted by society.
df = df_bookinfo.copy()
publisher_pop = df_bookinfo.groupby('Publisher')['Popularity'].sum().sort_values(ascending=False).to_frame()
top_9 = publisher_pop.index.to_list()[:9]
def differentiate_publisher(row):
if row.Publisher not in top_9:
return 'Other'
else:
return row.Publisher
df['Publisher'] = df.progress_apply(differentiate_publisher, axis=1)
plt.figure(figsize=(20,10))
sns.boxplot(y=df['Popularity'], x=df['Publisher'], order= top_9 + ['Other'], data=df)
df['Publisher'].value_counts()
plt.xticks(rotation=30)
plt.ylim(0,65000)
plt.title('Established publishers are publishing more popular books')
plt.show()
0%| | 0/44780 [00:00<?, ?it/s]
Publishers are mainly serve the purpose of spreading the book, to market it and get it on bookshelves, therefore one would expect to see established publishers to publish books with a higher popularity and this is what we observe in the boxplots above. We can observe that more established publishers have a higher median, Q1 and Q3 than other publishers.
Additionally, we observe that other publishers have a higher Q3 than independent publishers. This is also expected as they would have more resources and experience to market a book for it to gain more popularity.
#Item weight
df_format = df_bookinfo.loc[df_bookinfo['Main format'] == 'Physical'][general_params + physical_params]
df_format.dropna(subset=['Popularity', 'Item Weight'], axis=0, inplace=True)
df_format = df_format.loc[df_format['Item Weight'] < 500]
fig, axes = plt.subplots(2, 2, figsize=(20, 20))
sns.kdeplot(data=df_format, x = 'Item Weight', y = 'Popularity', ax=axes[0,0])
axes[0,0].set_xlim([0, 100])
axes[0,0].set_ylim([0,48000])
axes[0,0].set_title('Contour plot between Item Weight and Popularity')
pearson_coef, p_value = stats.pearsonr(df_format['Item Weight'], df_format['Popularity'])
axes[0,0].text(40, 40000, f'pearson_coef: {pearson_coef},\np_value: {p_value}')
axes[0,0].plot([8,8], [1,40000], color="red", linestyle='--')
axes[0,0].plot([15,15], [1,40000], color="red", linestyle='--')
#Num page
df_format = df_bookinfo.loc[(df_bookinfo['Main format'] == 'Physical') | (df_bookinfo['Main format'] == 'Ebook')][general_params + ['Num Page']]
df_format.dropna(subset=['Num Page', 'Popularity'], axis=0, inplace=True)
sns.kdeplot(data=df_format, x = 'Num Page', y = 'Popularity', ax=axes[0,1])
axes[0,1].set_title('Contour plot and scatterplot between Num Page and Popularity')
axes[0,1].set_xlim([0, 1350])
axes[0,1].set_ylim([0,42000])
axes[0,1].set_title('Contour plot between Num Page and Popularity')
axes[0,1].plot([100,130], [1,30000], color="red", linestyle='--')
axes[0,1].plot([350,380], [1,30000], color="red", linestyle='--')
pearson_coef, p_value = stats.pearsonr(df_format['Num Page'], df_format['Popularity'])
axes[0,1].text(600, 30000, f'pearson_coef: {pearson_coef},\np_value: {p_value}')
#File size
df_format = df_bookinfo.loc[df_bookinfo['Main format'] == 'Ebook'][general_params + ebook_params]
df_format.dropna(subset=['Popularity', 'File size'], axis=0, inplace=True)
sns.kdeplot(data=df_format, x = 'File size', y = 'Popularity', ax=axes[1,0])
axes[1,0].set_xlim([0, 500000])
axes[1,0].set_ylim([0, 60000])
axes[1,0].set_title('Contour plot between File size and Popularity')
pearson_coef, p_value = stats.pearsonr(df_format['File size'], df_format['Popularity'])
axes[1,0].text(150000, 50000, f'pearson_coef: {pearson_coef},\np_value: {p_value}')
#Listening Length
df_format = df_bookinfo.loc[df_bookinfo['Main format'] == 'Audiobook'][general_params + audiobook_params]
df_format.dropna(subset=['Popularity', 'Listening Length'], axis=0, inplace=True)
sns.kdeplot(data=df_format, x = 'Listening Length', y = 'Popularity', ax=axes[1,1])
axes[1,1].set_xlim([0, 2200])
axes[1,1].set_ylim([0, 1200])
axes[1,1].set_title('Contour plot between Listening Length and Popularity, zoomed in')
pearson_coef, p_value = stats.pearsonr(df_format['Listening Length'], df_format['Popularity'])
axes[1,1].text(1000, 1000, f'pearson_coef: {pearson_coef},\np_value: {p_value}')
axes[1,1].plot([250,425], [1,900], color="red", linestyle='--')
axes[1,1].plot([625,800], [1,900], color="red", linestyle='--')
fig.show()
#print(f'pearson_coef: {pearson_coef}, p_value: {p_value}')
As seen from the pearson r coefficient of the plots, there is no clear correlation between these parameters and popularity. But upon inspection of the graphs, with the exception of File size, a range where books within a certain bound tends to have higher popularity. While this again, may be explained as there just being more books within these ranges as they are standard, I do not think that it answers the question satisfactorily as the contour lines within these bounds extend farther from previous contour lines when on the graph. It suggests that there is another unknown factor that contributes to this behaviour that is observed. And considering the low p-values of the plots, with the exception of Item Weight, tell us that this behaviour is rather consistent.
df_format = df_bookinfo.loc[(df_bookinfo['Main format'] == 'Ebook') | (df_bookinfo['Main format'] == 'Audiobook')][general_params + ebook_params + audiobook_params]
accessibility_list = ['Text-to-Speech', 'Enhanced typesetting', 'X-Ray', 'Word Wise', 'Lending', 'Whispersync for Voice']
nrow = 1
ncol = 6
fig = plt.figure(figsize=(20,10))
gs = fig.add_gridspec(nrow, ncol, hspace=0, wspace=0)
axes = gs.subplots(sharex=True, sharey=True)
count = 0
for r in range(nrow):
for c in range(ncol):
df = df_format.dropna(subset=['Popularity'] + [accessibility_list[c]], axis=0)
sns.boxplot( y=df["Popularity"], x=df[accessibility_list[c]], ax=axes[c])#, hue='body-style')
axes[c].set_ylim([0,10000])
for ax in axes.flat:
ax.label_outer()
plt.suptitle('Box plots of popularity by presence of accessibility features')
fig.supxlabel('Accessibility features')
plt.show()
As seen in the graphs above, the availability of accessibility features is correlated to an increase in popularity, with the exception of Lending. I believe that this is so, not because people see lending as a feature to avoid, but that lending is disabled for more popular books to encourage more people to buy their books. Similarly, instead of these books attracting users based on the availability of accessibility features, to a certain extent, it could be the other way around as well.
On whether one should implement these features to attract readers, they should first understand the expected increase such a change would have, and compare it with the cost of implementing said feature to determine whether it should be profitable to do so.
#df_rating sum
df_ratingsum = df_reviews.groupby(["Book URL", "Date Bin"])["Rating"].sum().to_frame().reset_index()
def adjust_for_truncate(row):
num_review = map_[row['Book URL']][1] #get the number of reviews
a = row.Rating
if (num_review > 5000):
a = a * num_review / 5000
return a
df_ratingsum['Rating'] = df_ratingsum.progress_apply(adjust_for_truncate, axis=1)
df_ratingsum = df_ratingsum.rename(columns={'Rating':'Estimated Popularity'})
df_ratingsum.sort_values(by="Estimated Popularity",ascending=False) #plot the sums of the ratings against the ratings of the book
#df_withgenre
df_withgenre = pd.merge(df_searched, df_ratingsum, on='Book URL', how='right')
df_withgenre = df_withgenre.groupby(['Main Genre', 'Date Bin'])[['Estimated Popularity']].sum()
df_withgenre.reset_index(inplace=True)
df_withgenre = df_withgenre.loc[(df_withgenre['Date Bin'] < datetime.datetime(2022, 10, 1))] #Remove 2022 October as a bin as the month just started
#all time
df_ratingsum_alltime = df_reviews.groupby(["Book URL"])["Rating"].sum().to_frame().reset_index()
df_ratingsum_alltime['Rating'] = df_ratingsum_alltime.progress_apply(adjust_for_truncate, axis=1)
df_withgenre_alltime = pd.merge(df_searched, df_ratingsum_alltime, on='Book URL', how='right')
df_total = df_withgenre_alltime.groupby('Main Genre')[['Rating']].sum().sort_values(by='Rating', ascending=False)
df_top5_alltime = df_total.iloc[:5]
top5_list = list(df_top5_alltime.index)
#Converting df_withgenre to a form that allows us to plot area plot
df_temp = df_withgenre.copy()
test = df_temp['Date Bin'].unique()
test.sort()
def change_to_other(row):
if row['Main Genre'] not in top5_list:
return 'Other'
else:
return row['Main Genre']
df_temp['Main Genre'] = df_temp.progress_apply(change_to_other, axis=1)
df_temp = df_temp.groupby(['Main Genre', 'Date Bin'])[['Estimated Popularity']].sum() #Rename this to estimated popularity
df_temp.reset_index(inplace=True)
df_temp = df_temp.set_index('Date Bin')
0%| | 0/737138 [00:00<?, ?it/s]
0%| | 0/20013 [00:00<?, ?it/s]
0%| | 0/7665 [00:00<?, ?it/s]
df_list = []
for date_ in test:
a = df_temp.loc[date_]
if isinstance(a['Main Genre'], str):
aaaa = [a['Main Genre']]
bbbb = [a['Estimated Popularity']]
else:
aaaa = a['Main Genre'].to_list()
bbbb = a['Estimated Popularity'].to_list()
df_list.append(pd.DataFrame([bbbb], index=[date_], columns=aaaa))
df_area = pd.concat(df_list, axis=0)
df_area = df_area[['Teen & Young Adult', 'Biographies & Memoirs', 'Health, Fitness & Dieting', "Children's Books",
'Literature & Fiction', 'Other']]
df_area.plot(kind='area',
stacked=False,
figsize=(20, 10))
plt.xlabel('Year')
plt.ylabel('Estimated Popularity')
plt.title('Graph of Estimated Popularity of Top 5 Most Popular Genres over time, unstacked')
plt.show()
df_area.plot(kind='area',
stacked=True,
figsize=(20, 10))
plt.xlabel('Year')
plt.ylabel('Estimated Popularity')
plt.title('Graph of Estimated Popularity of Top 5 Most Popular Genres over time, stacked')
plt.annotate('',xy=(datetime.datetime(2009,1,1), 30000), xytext=(datetime.datetime(1999, 1, 1),30000),
arrowprops=dict(arrowstyle='->', color='blue', lw=2))
plt.text(datetime.datetime(2004,1,1), 35000, 'Relatively constant', va='center', ha='center')
plt.annotate('',xy=(datetime.datetime(2013,1,1), 260000), xytext=(datetime.datetime(2011, 1, 1),30000),
arrowprops=dict(arrowstyle='->', color='green', lw=2))
plt.text(datetime.datetime(2011,11,1), 152000, 'Steep increase', va='center', ha='center', rotation=74)
plt.annotate('',xy=(datetime.datetime(2022,1,1), 300000), xytext=(datetime.datetime(2014, 1, 1),300000),
arrowprops=dict(arrowstyle='->', color='blue', lw=2))
plt.text(datetime.datetime(2018,1,1), 320000, 'Fluctuates about relatively constant point', va='center', ha='center')
plt.show()
From the above graphs, we see how book genres relate with one another. The unstacked graph shows that changes relating with books as a whole, with all genres experiencing the same peaks and dips. Next, the stacked graph tells us that there was a large increase in estimated popularity from 2011 to 2013 in which, the popularity of all genres seem to have increased to varying degrees, after which stabilising but still changing in proportion of total as 2014 passed and 2022 came.
df_total = df_withgenre.groupby('Date Bin')[['Estimated Popularity']].sum().reset_index() #Total from every bin
def find_relative_popularity(row):
total_rating = df_total.loc[df_total['Date Bin'] == row['Date Bin']].iloc[0]['Estimated Popularity']
return row['Estimated Popularity'] / total_rating
df_list = []
for index_ in tqdm(index):
df_indivgenre = df_withgenre.loc[df_withgenre['Main Genre'] == index_].copy()
df_indivgenre['Relative Popularity'] = df_indivgenre.apply(find_relative_popularity, axis=1)
df_list.append(df_indivgenre)
nrow = 6
ncol = 5
fig = plt.figure(figsize=(20,16))
gs = fig.add_gridspec(nrow, ncol, hspace=0, wspace=0)
axes = gs.subplots(sharex=True, sharey=True)
count = 0
for r in range(nrow):
for c in range(ncol):
df_list[count].plot(ax=axes[r,c], x='Date Bin', y='Relative Popularity')
axes[r,c].text(datetime.datetime(2000, 1, 1), 0.8, index[count])
axes[r,c].get_legend().remove()
axes[r,c].set_xlabel('')
axes[r,c].plot([datetime.datetime(1999,1,1), datetime.datetime(2022,10,1)], [0.1,0.1], color="red", linestyle='--')
count += 1
for ax in axes.flat:
ax.label_outer()
fig.supxlabel('Date', x=0.5, y=0.15, fontsize=12)
fig.supylabel('Relative popularity', x=0.09, y=0.45, fontsize=12)
plt.suptitle('Relative popularity over time for every genre', x=0.5, y=0.9, fontsize=15)
plt.show()
0%| | 0/30 [00:00<?, ?it/s]
From the above plots, we see that Literature & Fiction and Children's Books has consistently had a relative popularity of greater than 10% and are the most popular genres today. Besides them, many others have crossed the 10% boundary before and were likely among the top genres at the time. Today, most other genres have either stabilised far below the boundary or are hovering slightly below it. The genres who appear to have the potential to become a top genre for the first time or could become a top genre again are:
Arts & Photography, Bibliographies and Memoirs, Business & Money, Healthm Fitness & Dieting, Mystery, Thriller & Suspense, Religion & Spirituality, Romance, Science & Math, Science Fiction & Fantasy, and Teen & Young Adult.
To answer this question, we see the distribution of Estimated Popularity by genre this past month
df_all.loc['Literature & Fiction'].index
df_withgenre = pd.merge(df_searched, df_ratingsum, on='Book URL', how='right')
df_recent = df_withgenre.loc[(df_withgenre['Date Bin'] == datetime.datetime(2022, 9, 1))] #Remove 2022 October as a bin as the month just started
df_recent.reset_index(drop=True, inplace=True)
plt.figure(figsize=(20,15))
sns.boxplot(x=df_recent['Estimated Popularity'], y=df_recent['Main Genre'], orient='h')
plt.xlim(0,250)
plt.title('Distribution of Estimated Popularity by genre in the past month')
plt.show()
From the graph above, we see that Mystery, Thriller & Suspense, Romance, and Science Fiction & Fantasy are have the highest medians of estimated popularity which indicate that for these genres, a book is most likely to have a higher estimated popularity.
Recomendations for authors: (although it heavily depends on the book you're writing) Write your book in the genres Mystery, Thriller & Suspense, Romance, or Science Fiction & Fantasy. Price the book at around 6-14 dollars, publish it on a Tuesday, publish with a hardcover/paperback or ebook format, ignoring audiobooks, have around 100 to 350 pages, have physical copies weigh around 8 to 15 ounces, have accessibility features unless deemed unprofitable.
Improvements / Further Works: This project has largely became a project centred on web scraping. With amazon imposing rate limits, combined with the vast amount of data available and the lack of methods to non-biasedly attain data, the data collected and investigated, even though large in its own regard, may be far from representative of the actual site.
Therefore, for future work, should the time permit, I would like to have a deeper investigation into the books on this site, to see the many more books probably out there that might be gems but would continue to be obscured from the public eye, to take into consideration.
Additionally, if time had permitted, I would have liked to explore the datasets with more depth, such as by doing a user analysis on how users behaved on the site (which genre of books do repeat reviewers frequent, what their reviewing behaviour is. etc.)
https://www.quora.com/How-many-books-does-Amazon-have-for-sale for the information on the number of books amazon had available
https://bettermarketing.pub/why-are-books-released-on-tuesdays-d20a3f98f467 for the information on publishing tuesday
Contains webscraping notebook and datasets.